Skip to content

Conversation

@mjosaarinen
Copy link
Contributor

@mjosaarinen mjosaarinen commented May 19, 2025

Summary:
rv64v support (risc-v vector extension 1.0, which is available on newer application-class silicon.)

Do you expect this change to impact performance: Yes/No
yes (risc-v only)

If yes, please provide local benchmarking results.
Roughly 2.5x perf on silicon with vector hardware.

@mjosaarinen mjosaarinen requested a review from a team as a code owner May 19, 2025 14:57
@mjosaarinen mjosaarinen force-pushed the rv64v-dev branch 3 times, most recently from 0637941 to 6b4f845 Compare May 19, 2025 18:35
@hanno-becker
Copy link
Contributor

@mjosaarinen If you have nix setup, running autogen should hopefully resolve the linting issues.

@mjosaarinen mjosaarinen force-pushed the rv64v-dev branch 2 times, most recently from dcab861 to 38e79bb Compare May 19, 2025 20:40
@rod-chapman
Copy link
Contributor

Note that on the "fastntt3" branch, there are layer-merged implementations of the NTT and INTT that are highly amenable to auto-vectorization with compilers like GCC 14. Benchmarks of that code on an RV64v target were encouraging, so might provide some inspiration for a fully vectorized, hand-written back-end.

@mjosaarinen
Copy link
Contributor Author

Note that on the "fastntt3" branch, there are layer-merged implementations of the NTT and INTT that are highly amenable to auto-vectorization with compilers like GCC 14. Benchmarks of that code on an RV64v target were encouraging, so might provide some inspiration for a fully vectorized, hand-written back-end.

Yeah you can easily double the speed with autovectorization alone, and some Google folks were of the opinion that they wanted to rely on that entirely in BoringSSL (RISC-V Android etc), rather than maintain a hand-optimized version. The resulting code is pretty wild; I looked at that when considering RISC-V ISA extensions ( see slides 17 for example in https://mjos.fi/doc/20240325-rwc-riscv.pdf ). It was almost "too good" -- I suspect that Google has used those NTTs as a microbenchmark when developing LLVM autovectorizers :)

@mjosaarinen
Copy link
Contributor Author

@mjosaarinen If you have nix setup, running autogen should hopefully resolve the linting issues.

Yeah, sorry for abusing your CI like that (I wasn't expecting it to be that extensive), I could have just read the documentation. I'll set up this nix thing.

@hanno-becker
Copy link
Contributor

hanno-becker commented May 20, 2025

@mjosaarinen Sorry, we should have pointed that out earlier. With the nix environment, you should not need to waste anymore time making the linter happy. Just run format && autogen before pushing.

@hanno-becker
Copy link
Contributor

hanno-becker commented May 27, 2025

@mkannwischer @mjosaarinen The RV functional tests in the ci-cross shell don't seem to pass.

% nix develop --extra-experimental-features 'nix-command flakes' .#ci-cross
% CROSS_PREFIX=riscv64-unknown-linux-gnu- make func OPT=1 AUTO=1 -j32
% EXEC_WRAPPER=qemu-riscv64 make run_func_512 -j32 OPT=1 AUTO=1
qemu-riscv64 test/build/mlkem512/bin/test_mlkem512
ERROR (test/test_mlkem.c,41)
ERROR (test/test_mlkem.c,225)
make: *** [Makefile:53: run_func_512] Error 1

Same for OPT=0.

Could you investigate?

Copy link
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RV64 cross tests in CI are failing. I don't know if this is an issue with the code or the CI setup, but it needs looking into.

@mjosaarinen
Copy link
Contributor Author

The RV64 cross tests in CI are failing. I don't know if this is an issue with the code or the CI setup, but it needs looking into.

It looks like some things are running fine, and some things have build problems. I suspect that platform auto-detection code (in the C header files + makefies) is to blame (failing no_opt means "no optimization"?) It's a quite complicated CI -- can you tell me what would be the best way to debug this stage of the CI build process locally?

@hanno-becker
Copy link
Contributor

hanno-becker commented May 27, 2025

It's a quite complicated CI -- can you tell me what would be the best way to debug this stage of the CI build process locally?

Sure. If you have nix installed, then it's exactly what I posted above: You open the nix shell with

% nix develop --extra-experimental-features 'nix-command flakes' .#ci-cross

Building using cross compiler:

% CROSS_PREFIX=riscv64-unknown-linux-gnu- make func AUTO=1 

And then running:

% qemu-riscv64 test/build/mlkem512/bin/test_mlkem512
% qemu-riscv64 test/build/mlkem768/bin/test_mlkem768
% qemu-riscv64 test/build/mlkem1024/bin/test_mlkem1024

The base command line used by AUTO=1 above is:

riscv64-unknown-linux-gnu-gcc -Wall -Wextra -Werror=unused-result -Wpedantic -Werror 
  -Wmissing-prototypes -Wshadow -Wpointer-arith -Wredundant-decls -Wno-long-long 
  -Wno-unknown-pragmas -Wno-unused-command-line-argument -O3 -fomit-frame-pointer 
  -std=c99 -pedantic -MMD  -march=rv64gcv_zvl256b 
  -DMLK_FORCE_RISCV64 -DMLK_CONFIG_USE_NATIVE_BACKEND_ARITH 
  -DMLK_CONFIG_USE_NATIVE_BACKEND_FIPS202 -DMLK_CONFIG_PARAMETER_SET=1024 

What is confusing to me is that this happens regardless of whether OPT=0 or OPT=1 (it defaults to OPT=0). If OPT=0, the new files aren't even compiled.

If you don't want be bothered by auto-settings, we can for now set AUTO=0 and manually specify everything on the command line using CC and CFLAGS:

% CFLAGS="-march=rv64gcv_zvl256b " CC=riscv64-unknown-linux-gnu-gcc make func AUTO=0
...
% qemu-riscv64 test/build/mlkem1024/bin/test_mlkem1024
ERROR (test/test_mlkem.c,49)
ERROR (test/test_mlkem.c,225)

@hanno-becker
Copy link
Contributor

hanno-becker commented May 27, 2025

@mkannwischer @mjosaarinen

It looks like the issue is solely due to -march=rv64gcv_zvl256b. Even on main, if you do:

CFLAGS="-march=rv64gcv_zvl256b" CC=riscv64-unknown-linux-gnu-gcc make func -j32 AUTO=0 OPT=0
qemu-riscv64 test/build/mlkem512/bin/test_mlkem512

it'll fail with

ERROR (test/test_mlkem.c,41)
ERROR (test/test_mlkem.c,225)

Probably some configuration is missing in the invocation of qemu-riscv64? Nowhere do we say that we're emulating a system with 256-bit vector extension.

@hanno-becker
Copy link
Contributor

qemu-riscv64 -cpu rv64,v=true,vlen=256 works. @mjosaarinen is that the right config?

@mjosaarinen
Copy link
Contributor Author

qemu-riscv64 -cpu rv64,v=true,vlen=256 works. @mjosaarinen is that the right config?

Hmm. I'm not sure if qemu is able to pull the standard runtime and dynamic libraries that match with the "vector ABI" (the calling convention changes somewhat with vector registers). I had to link the test programs with -static when I was testing the code. I was testing with spike and vector silicon myself, sorry I have not looked yet at the qemu environment used by CI.

Anyway, the code assumes "vector 1.0" extension, 256-bit VLEN, and the standard "gc" stuff, so hat looks correct. I left the Keccak instruction stuff out.

@hanno-becker
Copy link
Contributor

I adjusted the CI. Most tests pass, but there are some issues in the ACVP tests and the monobuild examples. The ACVP one seems to be an issue in the test script (not expecting parameters in the exec-wrapper), while the monobuild failure seems to be an architecture confusion. I'll take a look.

@hanno-becker hanno-becker mentioned this pull request May 28, 2025
@hanno-becker hanno-becker force-pushed the rv64v-dev branch 3 times, most recently from 951ab19 to 0c4d7e8 Compare May 28, 2025 12:19
@hanno-becker hanno-becker added the benchmark this PR should be benchmarked in CI label Oct 18, 2025
Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Intel Xeon 3rd gen (c6i)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 01bab43 Previous: 5aaca94 Ratio
ML-KEM-768 encaps 31969 cycles 30015 cycles 1.07

This comment was automatically generated by workflow using github-action-benchmark.

@hanno-becker
Copy link
Contributor

Use a mulcache. This should be completely straightforward but simplify and accelerate the basemul.

I was wrong. I tried out what seemed like the straightforward mulcache implementation to me:

void mlk_rv64v_poly_mulcache_compute(int16_t x[MLKEM_N / 2],
                                     const int16_t y[MLKEM_N])
{
#include "rv64v_zetas_basemul.inc"

  size_t vl = __riscv_vsetvl_e16m1(MLKEM_N) / 2;
  size_t i;

  for (i = 0; i < MLKEM_N; i += 2 * vl)
  {
    vint16m1_t b1 = 
        __riscv_vget_v_i16m1x2_i16m1(__riscv_vlseg2e16_v_i16m1x2(&y[i], vl), 1);
    vint16m1_t z = __riscv_vle16_v_i16m1(&roots[i / 2], vl);
    vint16m1_t bt = fq_mul_vv(b1, z, vl);
    __riscv_vse16_v_i16m1(&x[i / 2], bt, vl);
  }
}

static inline void mlk_rv64v_poly_basemul_mont_add_k(int16_t *r,
                                                     const int16_t *a,
                                                     const int16_t *b,
                                                     const int16_t *bc,
                                                     unsigned kn)
{
  size_t vl = __riscv_vsetvl_e16m1(MLKEM_N) / 2;
  size_t i, j;

  for (i = 0; i < MLKEM_N; i += 2 * vl)
  {
    vint32m2_t acc0 = __riscv_vmv_v_x_i32m2(0, vl);
    vint32m2_t acc1 = __riscv_vmv_v_x_i32m2(0, vl);

    for (j = 0; j < kn; j += MLKEM_N)
    {
      vint16m1x2_t x = __riscv_vlseg2e16_v_i16m1x2(&a[i + j], vl);
      vint16m1x2_t y = __riscv_vlseg2e16_v_i16m1x2(&b[i + j], vl);

      vint16m1_t x0 = __riscv_vget_v_i16m1x2_i16m1(x, 0);
      vint16m1_t x1 = __riscv_vget_v_i16m1x2_i16m1(x, 1);
      vint16m1_t y0 = __riscv_vget_v_i16m1x2_i16m1(y, 0);
      vint16m1_t y1 = __riscv_vget_v_i16m1x2_i16m1(y, 1);
      vint16m1_t yt = __riscv_vle16_v_i16m1(&bc[(i + j) / 2], vl);

      vint32m2_t x0y0 = __riscv_vwmul_vv_i32m2(x0, y0, vl);
      vint32m2_t x0y1 = __riscv_vwmul_vv_i32m2(x0, y1, vl);
      vint32m2_t x1y0 = __riscv_vwmul_vv_i32m2(x1, y0, vl);
      vint32m2_t x1yt = __riscv_vwmul_vv_i32m2(x1, yt, vl);

      acc0 = __riscv_vadd_vv_i32m2(acc0, x0y0, vl);
      acc0 = __riscv_vadd_vv_i32m2(acc0, x1yt, vl);
      acc1 = __riscv_vadd_vv_i32m2(acc1, x0y1, vl);
      acc1 = __riscv_vadd_vv_i32m2(acc1, x1y0, vl);
    }

    __riscv_vsseg2e16_v_i16m1x2(
        &r[i],
        __riscv_vcreate_v_i16m1x2(fq_redc2(acc0, vl), fq_redc2(acc1, vl)), vl);
  }
}

But this is notably slower. Interestingly, even ignoring the mulcache computation itself (which is almost twice as slow as tomont, oddly), the basemul is slower than @mjosaarinen's, despite the fewer multiplications and additions.

@mjosaarinen What can I learn from this? My guess is that vlseg2 and vsseg2 are very slow operations, but that would be a bit surprising because it's ultimately a combination of gather and load/store, and in your code there's heavy shuffling as well. Anyway, any intuition you can share about what works, and what doesn't, is appreciated.

Copy link
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three minor comments/questions, but none of them is blocking.
Happy to go ahead and merge this PR and revisit it in a follow-up if needed.

Thank you @mjosaarinen for contributing and thank you @hanno-becker for all the work you have put into this!

@hanno-becker hanno-becker force-pushed the rv64v-dev branch 3 times, most recently from 4514c4c to ecb4cef Compare October 18, 2025 18:06
mjosaarinen and others added 16 commits October 18, 2025 19:20
…sics.)

Signed-off-by: Markku-Juhani O. Saarinen <[email protected]>
Signed-off-by: Matthias J. Kannwischer <[email protected]>
Signed-off-by: Hanno Becker <[email protected]>
- Don't skip over 'opt' tests in CI when testing RV64
- Configure qemu to use 256-bit VL.

Once we support other VLs, the invocation of qemu needs
to be generalized accordingly.

Signed-off-by: Hanno Becker <[email protected]>
Signed-off-by: Matthias J. Kannwischer <[email protected]>
Signed-off-by: Hanno Becker <[email protected]>
This commit makes minor changes to the RV64 backend to make the
non-NTT component VLEN-agnostic. This mostly means using
__riscv_vsetvl_e16m1 instead of a hardcoded VLEN, plus some added
tail handling in rejection sampling.

For NTT and invNTT, the implementation is specific to VLEN=256,
and fallback to the C implementation for other vector lengths.
This should be improved in the future.

The compile-time guard for the RV64 backend is changed to checking
for RVV support only, but no specific VLEN. Similarly, the default
build flags use 128 rather then 256 as the minimum VLEN.

The README is adjusted accordingly. We also remove the warning
regarding the code being experimental, as it has by now undergone
thorough review and test.

Signed-off-by: Hanno Becker <[email protected]>
The backend API requires a bound of 8 * MLKEM_Q on the absolute
value of the output coefficients of the NTT. This bound is already
satisfied by the current implementation (which is aligned with the
AArch64 and x86_64 backends) and hence no final reduction is needed.

Signed-off-by: Hanno Becker <[email protected]>
A plain poly add/sub is not part of the backend API because we
trust autovectorization to handle it. The corresponding functions
mlk_rv64v_poly_add and mlk_rv64v_poly_sub are therefore dead code
and removed from the backend.

Signed-off-by: Hanno Becker <[email protected]>
Use high subtraction rather than addition in the Montgomery
reduction to avoid the possibility of a carry-in from the
low half. cf https://eprint.iacr.org/2018/039.

Signed-off-by: Hanno Becker <[email protected]>
Previously, the invNTT would keep the coefficients in unsigned
canonical range [0,MLKEM_Q).

This commit changes this to a lazy reduction strategy following
the AArch64 and x86_64 backends. Bounds are tracked coarsely and
reductions introduced where necessary.

The reduction is certainly not optimal, but the returns of further
improvements are diminishing while complicating review.

Bounds assertions (debug only) are introduced to check the bounds
at runtime.

Signed-off-by: Hanno Becker <[email protected]>
For VLEN >= 512, there are tail iterations in the rejection
handling loop where we require less coefficients than fit into
a vector, requiring a adjustment of the dynamic VL.

The previous code did re-evaluate the dynamic VL in every iteration,
which incurred a signifcant runtime cost. This commit instead splits
the rejection sampling loop in two nested loops, where the inner loop
proceeds for a fixed VL and the outer loop re-evaluates the VL.
For VL <= 256, there is only one iteration of the outer loop, rendering
it as efficient as the original verison.

Signed-off-by: Hanno Becker <[email protected]>
This is required by `-Wconversion`, and is indeed worth a comment
since a cast from `unsigned` to `int` can in theory overflow.

Also, unify unsigned types used in `rej_uniform` to `unsigned`,
and add explicit casts where the RVV intrinsics return `unsigned long`.

Signed-off-by: Hanno Becker <[email protected]>
@hanno-becker hanno-becker merged commit 833fc0a into main Oct 18, 2025
379 checks passed
@hanno-becker hanno-becker deleted the rv64v-dev branch October 18, 2025 18:57
@hanno-becker
Copy link
Contributor

hanno-becker commented Oct 18, 2025

Many thanks @mjosaarinen, this is a great addition.

When do you think you'll get to writing a Keccak backend, and adding [inv]NTTs for other VLENs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark this PR should be benchmarked in CI RV64

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants